Explore and Summarize Data by Mengtong Shen

Univariate Plots Section

To make the quality rating more readable, I assigned three quality levels to all the wines with different quality scores: ‘poor’ for wine’s quality below 5, ‘average’ for wine’s quality between 5 and below 7, ‘high’ for wine’s quality above 7.

Explore the distribution of all the features for the wine.

uni_qplot <- function(variable1, variable2){
  ggplot(data=WINE,aes_q(as.name(variable1)))+
    geom_histogram(binwidth=variable2)+
    ggtitle(variable1)
}
uni_qplot('fixed.acidity',0.2)+
  scale_x_continuous(breaks = 4:16)

uni_qplot('volatile.acidity',0.02)+
  scale_x_continuous(breaks = seq(0,1.6,0.1))

uni_qplot('citric.acid',0.05)+
  scale_x_continuous(breaks = seq(0,1,0.1))

uni_qplot('residual.sugar',0.2)+
  scale_x_continuous(breaks = seq(0,10,1))+
  coord_cartesian(xlim = c(0,10))

uni_qplot('chlorides',0.01)+
  scale_x_continuous(breaks = seq(0,0.2,0.05))+
  coord_cartesian(xlim = c(0,0.2))

uni_qplot('free.sulfur.dioxide',2)+
  scale_x_continuous(breaks = seq(0,70,5))+
  coord_cartesian(xlim = c(0,45))

uni_qplot('total.sulfur.dioxide',5)+
  scale_x_continuous(breaks = seq(0,300,25))+
  coord_cartesian(xlim = c(0,175))

uni_qplot('density',0.0005)+
  scale_x_continuous(breaks = seq(0.99,1.0025,0.0025))

uni_qplot('pH', 0.02)+
  scale_x_continuous(breaks = seq(0,4.5,0.1))

uni_qplot('sulphates',0.05)+
  scale_x_continuous(breaks = seq(0,2,0.25))+
  coord_cartesian(xlim = c(0.25,1.25))

uni_qplot('alcohol',0.5)+
  scale_x_continuous(breaks = seq(8,15,1))

uni_qplot('quality',0.3)

Univariate Analysis

names(WINE)
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"              "quality_lv"
str(WINE)
## 'data.frame':    1599 obs. of  14 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ quality_lv          : Ord.factor w/ 3 levels "poor"<"average"<..: 2 2 2 2 2 2 2 3 3 2 ...
summary(WINE)
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality        quality_lv  
##  Min.   : 8.40   Min.   :3.000   poor   :  63  
##  1st Qu.: 9.50   1st Qu.:5.000   average:1319  
##  Median :10.20   Median :6.000   high   : 217  
##  Mean   :10.42   Mean   :5.636                 
##  3rd Qu.:11.10   3rd Qu.:6.000                 
##  Max.   :14.90   Max.   :8.000

What is the structure of your dataset?

The dataset WINE has in total 1599 red wines and 12 features, which includes: fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, quality.

fixed.acidity’s median value: 7.90, max value:15.90, 75% of the red wines have less than 9.20 fixed.acidity.

volatile.acidity’s median value:0.52, max value:1.58, 75% of the red wines have less than 0.64 volatile.acidity.

citric.acid’s median value: 0.26, max value:1.00, 75% of the red wines have less than 0.42 citric.acid.

residual.sugar’s median value: 2.20, max value:15.50, 75% of the red wines have less than 2.60 residual.sugar.

chlorides’s median value: 0.079, max value:0.611, 75% of the red wines have less than 0.09 chlorides.

free.sulfur.dioxide’s median value:14.00, max value:72.00, 75% of the red wines have less than 21.00 sulfur.dioxide.

total.sulfur.dioxide’s median value:38.00, max value:289.00, 75% of the red wines have less than 62.00 total.sulfur.dioxide

density’s median value:0.9968, max value:1.0037, 75% of the red wines have less than 0.9978 density.

pH’s median value: 3.31, max value:4.01, 75% of the red wines have less than 3.4 pH.

sulphates’s median value:0.62, max value:2.00, 75% of the red wines have less than 0.73 sulphates.

alcohol’s median value: 10.20, max value: 14.90, 75% of the red wines have less than 11.10 alcohol.

quality’s median value: 6.00, max value:8.00, 75% of the red wines have less than 6.00 quality.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest in the dataset is the quality of the red wine. I want to find which features of red wine are the most important one in determining the quality.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

At this point, It’s hard to decide which features will be the most to influence the quality. However, I think among all the features, ‘fixed.acidity’, ‘citric.acid’, ‘pH’, and ‘alcohol’ might have more influences to the quality than the rest of the features.

Did you create any new variables from existing variables in the dataset?

Yes. I created a new variable called ‘quality_lv’ to make it easier to detect different levels of quality. I also created a new variable called ‘pH.bucket’ to make it easier to see the distribution of pH for all the red wines.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The distribution of residual suger,chlorides,free sulfur dioxide, total sulfur dioxide and sulphates have long tails, so I set x coordinates limit for each distribution to have a better image for each one.

Bivariate Plots Section

box_dots_plot <- function(variable){
  ggplot(data=WINE,aes_q(x=~quality,y=as.name(variable)))+
    geom_boxplot()+
    geom_jitter(alpha=1/5)+
    geom_line(aes(group=1),
              stat = 'summary',
              fun.y=median,
              color='#E74C3C',
              size=1,
              alpha=0.8)
}
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

## [1] 0.1240516
box_dots_plot('volatile.acidity')
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

cor(x=WINE$quality,y=WINE$volatile.acidity)
## [1] -0.3905578
#as the quality improves, the volatile acidity decrease. So there is a negative relationship between volatile.acidity and quality.
box_dots_plot('citric.acid')
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

group_by(WINE,quality) %>% 
  summarize(n_zero=sum(citric.acid==0)/n())
## # A tibble: 6 × 2
##   quality     n_zero
##     <int>      <dbl>
## 1       3 0.30000000
## 2       4 0.18867925
## 3       5 0.08370044
## 4       6 0.08463950
## 5       7 0.04020101
## 6       8 0.00000000
cor(x=WINE$quality,y=WINE$citric.acid)
## [1] 0.2263725
#as the quality improves, number of wines that has zero citric acid decreases. Therefore, quality and citric acid rate has positive relationship.
box_dots_plot('residual.sugar')+
  ylim(NA,quantile(WINE$residual.sugar,0.9))
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
## Warning: Removed 156 rows containing non-finite values (stat_boxplot).
## Warning: Removed 156 rows containing non-finite values (stat_summary).
## Warning: Removed 160 rows containing missing values (geom_point).

cor(x=WINE$quality,y=WINE$residual.sugar)
## [1] 0.01373164
#almost no apparent relationship between residual sugar and quality.
box_dots_plot('chlorides')+
  ylim(NA, quantile(WINE$chlorides,0.9))
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
## Warning: Removed 158 rows containing non-finite values (stat_boxplot).
## Warning: Removed 158 rows containing non-finite values (stat_summary).
## Warning: Removed 159 rows containing missing values (geom_point).

cor(x=WINE$quality,y=WINE$chlorides)
## [1] -0.1289066
#Weak relationship between chlorides and quality.
box_dots_plot('free.sulfur.dioxide')+
  geom_hline(yintercept = 50,color='#F1C40F',linetype=2, size=1.5)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

cor(x=WINE$quality,y=WINE$free.sulfur.dioxide)
## [1] -0.05065606
#no apparent relationship between free sulfur dioxide and quality
box_dots_plot('total.sulfur.dioxide')+
  ylim(NA,200)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
## Warning: Removed 2 rows containing non-finite values (stat_summary).
## Warning: Removed 2 rows containing missing values (geom_point).

cor(x=WINE$quality,y=WINE$total.sulfur.dioxide)
## [1] -0.1851003
#The bell shape distribution for the total surlfur dioxide is more concentrative around quality 5 and 6, as the quality improve further, the total sulfur dioxide decreases. Therefore, there is a negative relationship between total sulfur dioxide and quality.
box_dots_plot('density')
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

cor(x=WINE$quality,y=WINE$density)
## [1] -0.1749192
#As the quality improves, density decreases gradually. Therefore, there is a negative relationship between density and quality.
box_dots_plot('pH')
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

cor(x=WINE$quality,y=WINE$pH)
## [1] -0.05773139
#No apparent relationship between pH and quality.
box_dots_plot('sulphates')+
  ylim(NA,quantile(WINE$sulphates,0.9))
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
## Warning: Removed 150 rows containing non-finite values (stat_boxplot).
## Warning: Removed 150 rows containing non-finite values (stat_summary).
## Warning: Removed 157 rows containing missing values (geom_point).

cor(x=WINE$quality,y=WINE$sulphates)
## [1] 0.2513971
#As the quality improves, sulphates also increases. Therefore, there is a positive relationship between sulphates and quality.
box_dots_plot('alcohol')+
  xlab('Quality Level')+
  ylab('Alcohol')
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

cor(x=WINE$quality,y=WINE$alcohol)
## [1] 0.4761663
#alcohol is the most influential element in determining the quality.
#As quality improves, Alcohol concentration increases too.
cor(x=WINE$density,y=WINE$alcohol)
## [1] -0.4961798
#density is the most influential element to alcohol.
#lower density means higher alcohol concentration, therefore, lower density means higher quality.(negative relationship)
cor(x=WINE$residual.sugar,y=WINE$density)
## [1] 0.3552834
#except alcohol, sugar content is also very important in determining the density. Higher residual sugar level causes higher density, which is a positive relationship. Hence, residual sugar and alcohol have negative relationship, and negative relationship with wine quality.
#use the boxplot to verify the statement above.
re_plot <- function(variable1, variable2){
  ggplot(data=WINE,aes_q(x=as.name(variable1),y=as.name(variable2)))+
    geom_boxplot()+
    geom_jitter(alpha=1/5)+
    geom_line(aes(group=1),
              stat = 'summary',
              fun.y=median,
              color='#E74C3C',
              size=1,
              alpha=0.8)
}
re_plot('quality_lv','alcohol')+
  xlab('Quality Level')+
  ylab('Alcohol')

#alcohol and quality have positive relationship
re_plot('alcohol','density')+
  xlim(NA,quantile(WINE$alcohol,0.9))+
  xlab('Alcohol')+
  ylab('Density')
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
## Warning: Removed 141 rows containing non-finite values (stat_boxplot).
## Warning: Removed 141 rows containing non-finite values (stat_summary).
## Warning: Removed 149 rows containing missing values (geom_point).

#density and alcohol have negative relationship
re_plot('quality_lv','density')+
  xlab('Quality')+
  ylab('Density')

#density and quality have negative relationship
re_plot('residual.sugar','density')+
  xlab('Sugar')+
  ylab('Density')
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

#density and residual sugar have positive relationship

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The main feature I’m interest in here is the wine quality, some other variables that have strong relationship with wine quality are:

1.volatile.acidity:as the quality improves, the volatile acidity decrease. So there is a negative relationship between volatile.acidity and quality.

  1. citric acid: As the quality improves, number of wines that has zero citric acid decreases. Therefore, quality and citric acid rate has positive relationship.

3.total sulfur dioxide: The bell shape distribution for the total surlfur dioxide is more concentrative around quality 5 and 6, as the quality improve further, the total sulfur dioxide decreases. Therefore, there is a negative relationship between total sulfur dioxide and quality.

4.density: As the quality improves, density decreases gradually. Therefore, there is a negative relationship between density and quality.

5.sulphates: As the quality improves, sulphates also increases. Therefore, there is a positive relationship between sulphates and quality.

6.alcohol:alcohol is the most influential element in determining the quality. As quality improves, Alcohol concentration increases too, so they have positive relationship.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

1.density is the most influential element to alcohol. lower density means higher alcohol concentration, therefore, lower density means higher quality.(negative relationship)

2.except alcohol, sugar content is also very important in determining the density. Higher residual sugar level causes higher density, which is a positive relationship. Hence, residual sugar and alcohol have negative relationship, and negative relationship with wine quality.

What was the strongest relationship you found?

The strongest relationship I found is the relationship the wine quality has with alcohol concentration, which has correlation 0.4761663, and means that as the alcohol concentration increase, the higher the wine’s quality.

Multivariate Plots Section

ggplot(aes(x=volatile.acidity, y=alcohol, color=quality_lv),data=WINE)+
  geom_point()+
  facet_wrap(~quality_lv,ncol=3)

cor(x=WINE$volatile.acidity,y=WINE$quality)
## [1] -0.3905578
ggplot(aes(x=volatile.acidity,color=quality_lv),data=WINE)+
  geom_density()+
  theme_classic()

#higher quality wine has lower volatile acidity level.most high quality wines have volatile acidity around 0.4, most average quality wines have volatile acidity around 0.5, and most poor quality wines have volatile acidity around 0.7.
ggplot(aes(x=fixed.acidity, y=alcohol, color=quality_lv),data=WINE)+
  geom_point()+
  facet_wrap(~quality_lv,ncol=3)

cor(x=WINE$fixed.acidity,y=WINE$quality)
## [1] 0.1240516
ggplot(aes(x=fixed.acidity, color=quality_lv),data=WINE)+
  geom_density()+
  theme_classic()

#higher quality wine has higher level of fixed acidity. poor and average quality of wine has most fixed acidity around 6-7, while high quality of wine has most around 9-10.
ggplot(aes(x=volatile.acidity, y=fixed.acidity, color=quality_lv),data=WINE)+
  geom_point()+
  facet_wrap(~quality_lv,ncol=3)

cor(x=WINE$volatile.acidity,y=WINE$fixed.acidity)
## [1] -0.2561309
#from the graph above, we can prove the first two points are correct which basicaly state that wine quality and fixed acidity has positive relationship and negative realtionship with volatile acidity.
ggplot(aes(x=citric.acid, y=alcohol, color=quality_lv),data=WINE)+
  geom_point()+
  facet_wrap(~quality_lv,ncol=3)

cor.test(x=WINE$citric.acid, y=WINE$quality)
## 
##  Pearson's product-moment correlation
## 
## data:  WINE$citric.acid and WINE$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725
ggplot(aes(x=citric.acid, color=quality_lv),data=WINE)+
  geom_density()+
  theme_classic()

#higher quality wine has higher level of citric acid. 
ggplot(aes(x=residual.sugar, y=alcohol, color=quality_lv),data=WINE)+
  geom_point()+
  facet_wrap(~quality_lv,ncol=3)

cor(x=WINE$residual.sugar,y=WINE$quality)
## [1] 0.01373164
ggplot(aes(x=residual.sugar, color=quality_lv),data=WINE)+
  geom_density()+
  theme_classic()

#There is no correlation between the quality and the residual sugar level.
ggplot(aes(x=chlorides, y=alcohol, color=quality_lv),data=WINE)+
  geom_point()+
  facet_wrap(~quality_lv,ncol=3)

cor(x=WINE$chlorides,y=WINE$quality)
## [1] -0.1289066
ggplot(aes(x=chlorides,color=quality_lv),data=WINE)+
  geom_density()+
  theme_classic()

#In general, higher quality wine has lower level of chlorides, although the correlation is very weak.
ggplot(aes(x=free.sulfur.dioxide, y=alcohol,color=quality_lv),data=WINE)+
  geom_point()+
  facet_wrap(~quality_lv,ncol=3)

cor(x=WINE$free.sulfur.dioxide,y=WINE$quality)
## [1] -0.05065606
ggplot(aes(x=free.sulfur.dioxide, color=quality_lv),data=WINE)+
  geom_density()+
  theme_classic()

#There is no correlation between the quality and the concentration level of free sulfur dioxide.
ggplot(aes(x=total.sulfur.dioxide,y=alcohol,color=quality_lv),data=WINE)+
  geom_point()+
  facet_wrap(~quality_lv,ncol=3)

cor(x=WINE$total.sulfur.dioxide,y=WINE$quality)
## [1] -0.1851003
ggplot(aes(x=total.sulfur.dioxide,color=quality_lv),data=WINE)+
  geom_density()+
  theme_classic()

#In general, higher quality wine has lower level of total sulfur dioxide, although the correlation is very weak.
ggplot(aes(x=density,y=alcohol,color=quality_lv),data=WINE)+
  geom_point()+
  facet_wrap(~quality_lv,ncol = 3)

cor(x=WINE$density,y=WINE$quality)
## [1] -0.1749192
ggplot(aes(x=density,color=quality_lv),data=WINE)+
  geom_density()+
  theme_classic()

#In general, wine with higher quality has lower level of density, although it's not the case between wine with poor quality and wine with average quality, but wine with high quality do have lower level of density in general. 
ggplot(aes(x=pH,y=alcohol,color=quality_lv),data=WINE)+
  geom_point()+
  facet_wrap(~quality_lv,ncol = 3)

cor(x=WINE$pH,y=WINE$quality)
## [1] -0.05773139
ggplot(aes(x=pH, color=quality_lv),data=WINE)+
  geom_density()+
  theme_classic()

#There is no correlation between the pH and the qualities of wine. 
ggplot(aes(x=sulphates,y=alcohol,color=quality_lv),data=WINE)+
  geom_point()+
  facet_wrap(~quality_lv,ncol=3)

cor(x=WINE$sulphates,y=WINE$quality)
## [1] 0.2513971
ggplot(aes(x=sulphates, color=quality_lv),data=WINE)+
  geom_density()+
  theme_classic()

#In general, wine with higher quality has higher level of sulphates concentration.
ggplot(aes(x=fixed.acidity+volatile.acidity+citric.acid,y=pH),
       data=WINE)+
  geom_point(alpha=0.2)+
  geom_smooth(method = 'loess', color='red')

cor(x=WINE$fixed.acidity+WINE$volatile.acidity+WINE$citric.acid,y=WINE$pH)
## [1] -0.6834838
#As the graph shows above, pH value get lower when the overall acid concentration get higher.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Higher amount of volatile acidity/citric acid/sulphates along with higher alcohol concentration yield better quality wines.

lower concentration of volatile acidity along with higher alcohol concentration yield better quality wines.

Were there any interesting or surprising interactions between features?

pH value is mainly determined by three factors: fixed acidity, volatile acidity, and citric acid with correlation -0.683.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I build three linear models to analyse the relationships wine quality has with other features. The first linear model is specifically designed for analysing each features of the wine. The second one is for showing the relationship between quality and each features. The third one is for verifying the conclusion we have from the second linear model.

Final Plots and Summary

Plot One

Description One

From the previous analysis about which features is the most influential one to the quality of wine, alcohol stands out with the strongest positive correlation with quality. From the boxplot above, we can conclude that the higher level of alcohol the wine has, the better quality it becomes.

Plot Two

Description Two

Besides alcohol, volatile acidity is the second most influential feature to the quality of wine. From the previous analysis and the boxplot above, we can conclude that the higher level of volatile acidity will cause lower level of qualuty of the wine.

Plot Three

Description Three

From the plot above, we can see the combined effect of alcohol and volatile acidity on the quality of wine: The wines with higher level of alcohol and lower level of volatile acidity have better quality in general, and the wines with lower level of alcohol and higher level of volatile acidity mostly have lower quality rating.


Reflection

The red wine dataset has 1599 samples and 11 features, in order to analyze which features are important to the quality of wines, I build 3 linear models to explore those features and by coloring wines with different quality in the multivariate plots, it becomes much more clear and easy to interpret different features have what kind of influences on the quality of wines.

From the analysis, we can see that alcohol and volatile acidity contributes to the top two most important features to the wine’s quality. Quality and alcohol has a positive relationship while has negative relationship with volatile acidity.

For the purpose of having a better analysis about this dataset, it would be better if it includes more wines with poor or high quality, so that we can improve our accuracy while conducting the analysis.